Commentz-walter: Any Better than Aho- Corasick for Peptide Identification?
نویسندگان
چکیده
An algorithm for locating all occurrences of a finite number of keywords in an arbitrary string, also known as multiple strings matching, is commonly required in information retrieval (such as sequence analysis, evolutionary biological studies, gene/protein identification and network intrusion detection) and text editing applications. Although Aho-Corasick was one of the commonly used exact multiple strings matching algorithm, Commentz-Walter has been introduced as a better alternative in the recent past. Comments-Walter algorithm combines ideas from both Aho-Corasick and Boyer Moore. Large scale rapid and accurate peptide identification is critical in computational proteomics. In this paper, we have critically analyzed the time complexity of Aho-Corasick and Commentz-Walter for their suitability in large scale peptide identification. According to the results we obtained for our dataset, we conclude that Aho-Corasick is performing better than Commentz-Walter as opposed to the common beliefs.
منابع مشابه
A Boyer-Moore-style algorithm for regular expression pattern matching
Richard E. Watson Dept. of Mathematics Simon Fraser University Burnaby B.C., Canada watsona@sfu. ca This paper presents a Boyer-Moore type algorithm for regular expression pattern matching, answering an open problem posed by A. V. Aho in 1980 [Aho80, p. 3421. The new algorithm handles patterns specified by regular expressions a generalization of the Boyer-Moore and Commentz-Walter algorithms (w...
متن کاملMultiple Pattern String Matching Methodologies: A Comparative Analysis
String matching algorithms in software applications like virus scanners (anti-virus) or intrusion detection systems is stressed for improving data security over the internet. String-matching techniques are used for sequence analysis, gene finding, evolutionary biology studies and analysis of protein expression. Other fields such as Music Technology, Computational Linguistics, Artificial Intelli...
متن کاملSPARE Parts: A C++ Toolkit for String PAttern REcognition
In this paper, we consider the design and implementation of SPARE Parts, a C++ toolkit for pattern matching. SPARE Parts is the second generation string pattern matching toolkit from the Ribbit Software Systems Inc. and the Eindhoven University of Technology. The toolkit contains implementations of the well-known Knuth-Morris-Pratt, Boyer-Moore, Aho-Corasick and Commentz-Walter algorithms (and ...
متن کاملTo Use or Not to Use: Graphics Processing Units for Pattern Matching Algorithms
String matching is an important part in today’s computer applications and Aho-Corasick algorithm is one of the main string matching algorithms used to accomplish this. This paper discusses that when can the GPUs be used for string matching applications using the Aho-Corasick algorithm as a benchmark. We have to identify the best unit to run our string matching algorithm according to the perform...
متن کاملEfficiency of AC-Machine and SNFA in Practical String Matching
A note on practical experience with on Aho-Corasick-machine and SNFA (Searching NFA) estimating the construction aspects and run cost. It is shown that SNFA can be more practical than AC-machine.
متن کامل